Main analysis
We present this exploratory data analysis in four sections as outlined in the introduction: artists, words, and audio features.
Artists
In this subsection, we examine the data at the artist level to answer some basic initial questions.
top_artists = df %>%
group_by(artist_base) %>%
summarize(num_singles = n()) %>%
arrange(desc(num_singles))
top_artists = top_artists[0:30,]
ggplot(top_artists, aes(x = reorder(artist_base, num_singles), y = num_singles)) +
geom_col() +
coord_flip() +
xlab('Artist') +
ylab('Number of yearly top 100 singles from 1965-2015') +
labs(title = 'Who are the most popular artists of the past 50 years?')
most_explicit_artists = df %>%
group_by(artist_base) %>%
summarize(explicitness = sum(explicit)) %>%
arrange(desc(explicitness))
most_explicit_artists = most_explicit_artists[0:30,]
ggplot(most_explicit_artists, aes(x = reorder(artist_base, explicitness), y = explicitness)) +
geom_col() +
coord_flip() +
xlab('Artist') +
ylab('Number of yearly top 100 explicit singles from 1965-2015') +
labs(title = 'Who are the most popular explicit artists of the past 50 years?')
To measure the explicitness of each artist, we initially considered averaging the binary 0/1 explicit attribute of each song by artist. However, because many artists only have 1 single in the Yearly Top 100 throughoug their career, this results in many artists having an average explicitness of 1. Thus, we fall back to simply counting the number of explicit singles per artist. Interestingly, only the top 3 most explicit artists (Eminem, Ludacris, and Drake) are also among the top 30 most popular artists. Eminem and Ludacris are famously prolific and explicit; in fact we can see that all 15 and 14, respectively, of their top singles are explicit. More generally, we also notice that the vast majority of these singles are from the hip hop and rap genres.
most_featuring_artists = df %>%
mutate(is_collab = str_detect(artist, 'feat')) %>%
group_by(artist_base) %>%
summarize(num_collaborations = sum(is_collab)) %>%
arrange(desc(num_collaborations))
most_featuring_artists = most_featuring_artists[0:20,]
p1 = ggplot(most_featuring_artists, aes(x = reorder(artist_base, num_collaborations),
y = num_collaborations)) +
geom_col() +
xlab('Main artist') +
ylab('Number of top 100 singles featuring a guest artist') +
scale_y_continuous(breaks=1:9, labels=1:9) +
coord_flip()
matches = str_match(as.list(df['artist'])$artist, 'featuring\\s(.*)')
matches = matches[, 2]
matches = matches[!is.na(matches)]
matches = as_tibble(matches)
featured_artists = matches %>%
group_by(value) %>%
summarize(num_features = n()) %>%
arrange(desc(num_features))
featured_artists = featured_artists[1:20,]
p2 = ggplot(featured_artists, aes(x = reorder(value, num_features),
y = num_features)) +
geom_col() +
xlab('Featured artist') +
ylab('Number of top 100 singles featured as guest') +
scale_y_continuous(breaks=1:12, labels=1:12) +
coord_flip()
grid.arrange(p1, p2, ncol=2)
Many top singles are the product of a collaboration between two artists, with the resulting artist attribution of the song taking the form “X featuring Y”, where X denotes the main artist and Y denotes the guest/featured artist. Above, we see that Rihanna and Chris Brown most frequently feature other artists in their top singles, while Lil Wayne and T-Pain are the most frequent guests on other artists’ top singles. Particularly interesting is the fact that while Usher, Timbaland, Santana, and David Guetta are among the 20 most “featuring” artist, they are not among even the top 20 most “featured” artists. Conversely, T-Pain, Snopp Dogg, Nicki Minah, and Will.I.Am are among the 20 most “featured” artist despite not being among the 20 most “featuring” artists.
collaborations = df %>%
mutate(is_collab = str_detect(artist, 'feat')) %>%
group_by(year) %>%
summarize(num_collaborations = sum(is_collab))
ggplot(collaborations, aes(x = year, y = num_collaborations)) +
geom_line() +
xlab('Year') +
ylab('Number of top 100 singles featuring a guest artist')
More interesting still is the rising trend in artist collaborations over time that begins during the 1990s and takes off dramatically before plateauing in the late 2000s. Additional comments about this trend are presented in the executive summary.
nunique_artists_year = df %>%
group_by(year) %>%
summarize(nunique = n_distinct(artist_base))
options(repr.plot.width = 16, repr.plot.height = 6)
ggplot(nunique_artists_year, aes(x = year, y = nunique), fill='black') +
geom_line() +
ylab('Number of unique artists')
The plot above shows artist diversity over time, with diversity defined not on the basis of race but on the number of unique artists with top 100 singles each year. The trend depicted shows a mixed picture; the most diverse year is in the early 1970s while the least diverse is in 2009 and 2010, yet the pattern remains noisy enough that more years are needed before making a more solid determination.
Audio features
In this subsection, we turn our focus to the audio features obtained from Spotify for each song, particularly how these features are related to one another as well as how they change over time. Originally, we aggregated each feature by year and plotted how the mean evolves over time. Not satisfied with the resulting loss in information by aggregating using solely the mean, we then incorporated additional aggregations such as the maximum and the minimum for each year before ultimately deciding to create box plots for each year to minimize information loss.
#Duration
spotifydf <- df%>%
group_by(year)
ggplot(spotifydf) + geom_boxplot(aes(year, duration_min, group=year))+
ggtitle("Duration over time")+
ylab("Duration (in Minutes)")+
scale_x_continuous(breaks = seq(1960,2020,5))
## Warning: Removed 327 rows containing non-finite values (stat_boxplot).
From above, we observe a gradual increase in the durations of top 100 singles that peaks at 1990 with median song lengths just shy of 5 minutes before starting on a downward trend to under 4 minutes. This trend of declining song durations coincides with a decline in the variation of song lengths. One possible explanation is that as the music industry has become increasingly competitive, artists are forced to grab their audience’s attention as quickly as possible, and as listeners gain access to an ever-expanding catalogue of songs, their attention spans are decreasing.
#Acousticness
ggplot(spotifydf) + geom_boxplot(aes(year, acousticness, group=year)) +
ggtitle("Acousticness over time") +
ylab("Acousticness") +
scale_x_continuous(breaks = seq(1960,2020,5))
## Warning: Removed 327 rows containing non-finite values (stat_boxplot).
Top 100 songs have been trending downward in acousticness from 1965 to 2015. This trend might also be attributed to an increasingly competitive music industry; live studio musicians and indeed recording studios themselves are much more expensive in contrast to hardware synthesizers, and later, software synthesizers on which artists can create entire songs using only a laptop computer.
#Danceability
ggplot(spotifydf) + geom_boxplot(aes(year, danceability, group=year)) +
ggtitle("Danceability over time")+
ylab("Danceability ")+
scale_x_continuous(breaks = seq(1960,2020,5))
## Warning: Removed 327 rows containing non-finite values (stat_boxplot).
While the median danceability of songs has remained relatively stable from 1985 to 2015, there exists a noticeable increase from 1965 to 1984.
#Energy
ggplot(spotifydf) + geom_boxplot(aes(year, energy, group=year)) +
ggtitle("Energy over time")+
ylab("Energy ")+
scale_x_continuous(breaks = seq(1960,2020,5))
## Warning: Removed 327 rows containing non-finite values (stat_boxplot).
Overall, there appears that songs are becoming more energetic and busier over time, though the signal is quite noisy. Most interestingly, from 1985 onwards, the change in median energy over time appears to even be cyclical.
#Instrumentalness
ggplot(spotifydf) + geom_boxplot(aes(year, instrumentalness, group=year)) +
ggtitle("Instrumentalness over time")+
ylab("Instrumentalness ")+
scale_x_continuous(breaks = seq(1960,2020,5))
## Warning: Removed 327 rows containing non-finite values (stat_boxplot).
Unsurprisingly, the median instrumentalness of top 100 singles consistently stays at or near 0, however we can observe a noticeable thinning out of outliers, suggesting that instrumental music is becoming increasingly unpopular. Additionally, this decline in the instrumentalness of songs could be tied to the decline of song lengths; if songs are becoming shorter because of declining listener attention spans, then one is likely to also observe songs with increasingly shorter instrumental introductions and interludes as artists favor “getting to the point.”
#Liveness
ggplot(spotifydf) + geom_boxplot(aes(year, liveness, group=year)) +
ggtitle("Liveness over time")+
ylab("Liveness ")+
scale_x_continuous(breaks = seq(1960,2020,5))
## Warning: Removed 327 rows containing non-finite values (stat_boxplot).
There are no discernable trends in how the distribution of songs’ liveness changes over time.
#Loudness
ggplot(spotifydf) + geom_boxplot(aes(year, loudness, group=year)) +
ggtitle("Loudness over Time")+
ylab("Loudness ")+
scale_x_continuous(breaks = seq(1960,2020,5))
## Warning: Removed 327 rows containing non-finite values (stat_boxplot).
Top 100 singles are unquestionably becoming much louder. We expand on an explanation of the “Loudness Wars” in the executive summary.
#Speechiness
ggplot(spotifydf) + geom_boxplot(aes(year, speechiness, group=year)) +
ggtitle("Speechiness over Time")+
ylab("Speechiness ")+
scale_x_continuous(breaks = seq(1960,2020,5))
## Warning: Removed 327 rows containing non-finite values (stat_boxplot).
Also pronounced is a dramatic increase in speechiness starting from 1990 and peaking in 2004 which may correspond with the increased popularity of rap over that period of time.
#Tempo
ggplot(spotifydf) + geom_boxplot(aes(year, tempo, group=year)) +
ggtitle("Tempo over Time")+
ylab("Tempo ")+
scale_x_continuous(breaks = seq(1960,2020,5))
## Warning: Removed 327 rows containing non-finite values (stat_boxplot).
There are no discernable trends in how the distribution of songs’ tempos changes over time.
#Valence
ggplot(spotifydf) + geom_boxplot(aes(year, valence, group=year)) +
ggtitle("Valence over Time")+
ylab("Valence")+
scale_x_continuous(breaks = seq(1960,2020,5))
## Warning: Removed 327 rows containing non-finite values (stat_boxplot).
There is a general trend of songs becoming increasingly less positive, perhaps reflecting the increase in cultural and economic pressure society has been experiencing.
#Verbosity
ggplot(spotifydf) + geom_boxplot(aes(year, words_per_sec, group=year)) +
ggtitle("Verbosity over Time")+
ylab("Words per second")+
scale_x_continuous(breaks = seq(1960,2020,5))
## Warning: Removed 327 rows containing non-finite values (stat_boxplot).
Corresponding to the trends involving speechiness over time, we see that the number of words heard per second in songs has increased over time. Indeed, one should expect this given the previously observed speechiness trends as it is much easier to speak quickly than to sing quickly.
# ADD parcoords and corresponding comments
To examine the correlation structure of the data’s continuous features, we create a correlation matrix and visualize it with a heatmap below. Additionally, we arrange the features so that those with the most similar correlations to other features are placed near each other, with the dendrograms on the top and left showing which features are most similar to each other.
columns = c('rank', 'year', 'acousticness', 'danceability', 'duration_min',
'energy', 'instrumentalness', 'liveness', 'loudness', 'popularity',
'speechiness', 'tempo', 'valence', 'words_per_sec')
df_cor = df[columns]
correlation = cor(df_cor, method='pearson', use='pairwise.complete.obs')
col = colorRampPalette(c('red', 'white', 'green'))(20)
heatmap(x = correlation, col = col, symm = TRUE)
Note: This heatmap below was originally created in Python using the Seaborn visualization library. In translating the visualization to R using R’s heatmap functionality, we were unable to produce a legend nor annotate the grid cells individually. We present this second, more ideal heatmap as an image in addition to the one created in R above.
stuff
The dendrogrammed correlation heatmap reveals some interesting aspects of the data’s correlation structure:
- Popularity and year are weakly correlated. This makes intuitive sense given that the audience on Spotify are more likely to listen to more current music and that people who would enjoy the older music might be less likely to listen to it on Spotify.
- Loudness and year are also weakly correlated, as previously observed.
- Loudness and energy are moderately correlated.
- Acousticness and energy are weakly anticorrelated. This could be explained by the fact that ballads and other slow music are more likely to feature acoustic instruments.
- Valance and danceability are weakly correlated. This could suggest that for the most part, people are less inclined to dance to sad or angry music.
- Words per second is weakly correlated with speechiness. Indeed, it is easier to talk fast than to sing fast.
The correlation heatmap provides guidance as to which pairs of features are worth investigating further.
ggplot(df, aes(x=acousticness, y=energy, color=year)) +
geom_point(alpha=0.5) +
ggtitle('Acousticness vs. energy')
## Warning: Removed 327 rows containing missing values (geom_point).
It appears that most songs have very low acousticness (below 0.125) and high energy (above 0.5). However, outside this dense region, as a song increases in acousticness, its energy appears to decrease quadratically. Furthermore, it appears that these less common high-acoustic/lower-energy songs are predominantly older, with the vast majority of more recent songs occupying the low-acoustic/higher-energy region. We can observe this more clearly by faceting on decade.
ggplot(df, aes(x=acousticness, y=energy)) +
geom_point(alpha=0.25) +
facet_wrap(~ decade)
## Warning: Removed 327 rows containing missing values (geom_point).
The facetted scatterplots confirm more clearly that indeed songs are trending towards lower acousticness and higher energy with each passing decade (which was also observed in the time series box plots). It should be noted that because the time range of the dataset is from 1965-2015, the facets for the 1960s and 2010s have half as many data points as the facets for other decades.
ggplot(df, aes(x=danceability, y=valence, color=year)) +
geom_point(alpha=0.5) +
ggtitle('Danceability vs valence')
## Warning: Removed 327 rows containing missing values (geom_point).
ggplot(df, aes(x=speechiness, y=words_per_sec, color=year)) +
geom_point(alpha=0.5) +
ggtitle('Speechiness vs verbosity')
## Warning: Removed 327 rows containing missing values (geom_point).
ggplot(df, aes(x=speechiness, y=words_per_sec)) +
geom_point(alpha=0.25) +
facet_wrap(~ decade)
## Warning: Removed 327 rows containing missing values (geom_point).
Above, we observe that speechiness and verbosity (words per second) are indeed positively correlated, albeit moderately and that top 100 singles have become increasingly more speechy and verbose starting from the 1990s when rap and hip hop first gained mainstream appeal.
p1 = ggplot(df, aes(x=energy, y=loudness, color=year)) +
geom_point(alpha=0.5) +
ggtitle('Energy vs. loudness')
p2 = ggplot(df, aes(x=energy, y=loudness)) +
geom_point() +
facet_wrap(~ decade)
grid.arrange(p2, p1, ncol = 2)
## Warning: Removed 327 rows containing missing values (geom_point).
## Warning: Removed 327 rows containing missing values (geom_point).